NSF PAR Search | NSF Public Access Repository

Flextron: Many-in-One Flexible Large Language Model

Cai, Ruisi; Muralidharan, Saurav; Heinrich, Greg; Yin, Hongxu; Wang, Zhangyang; Kautz, Jan; Molchanov, Pavlo (August 2024, https://doi.org/10.48550/arXiv.2406.10260)

raining modern large language models (LLMs) is extremely resource-intensive, and repeatedly customizing them for deployment scenarios with limited compute and memory is impractical. This paper introduces Flextron, a network architecture and post-training model optimization framework that supports flexible model deployment. Flextron uses a nested elastic structure that adapts rapidly to user-defined latency and accuracy targets during inference without requiring additional fine-tuning. It is also input-adaptive, automatically routing tokens through sub-networks for improved efficiency and performance. The authors propose a sample-efficient training method and routing algorithms to systematically transform an already-trained LLM into a Flextron model. Evaluation on the GPT-3 and LLaMA-2 families demonstrates Flextron’s superior performance over end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes only 7.63% of the tokens compared to original pretraining.

Full Text Available

Search for: All records